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Abstract 

We develop an algorithm for reconstructing magnetic resonance images (MRI) from highly undersam- 
pled /c-space data. While existing methods focus on either image-level or patch-level sparse regularization 
strategies, we present a regularization framework that uses both image and patch-level sparsity constraints. 
The proposed regularization enforces image-level sparsity in terms of spatial finite differences (total 
variation) and patch-wise sparsity through in situ dictionary learning. We use the beta-Bernoulli process 
as a Bayesian prior for dictionary learning, which adaptively infers the dictionary size, the sparsity of 
each patch and the noise parameters. In addition, we employ an efficient numerical algorithm based on 
the alternating direction method of multipliers (ADMM). We present empirical results on several MR 
images, which show that the proposed regularization framework can improve reconstruction accuracy 
over other methods 1 

Index Terms 

magnetic resonance imaging, compressed sensing, dictionary leaming, Bayesian nonparametrics 

L Introduction 

Magnetic resonance (MR) imaging is a w^idely used non-invasive and non-ionizing technique for 
visualizing the anatomical structure and physiological functioning of the body. A limitation of MR 
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imaging is its slow scan speed during data acquisition, which is a drawback especially in dynamic 
imaging applications. Therefore, methods for speeding up the MRI process have received much research 
attention. Recent advances in signal reconstruction from measurements sampled below the Nyquist 
rate, called compressed sensing [1][2], have had a major impact on MRI [3]. CS-MRI allows for 
significant undersampling in fc-space, where each measurement is a Fourier transform coefficient, while 
still outputting a high-quality image reconstruction. MRI reconstruction using undersampled fc-space data 
is a case of an ill-posed inverse problem. However compressed sensing (CS) theory has shown that it 
is possible to reconstruct a signal from significantly fewer measurements than mandated by traditional 
Nyquist sampling if the signal is sparse in a particular transform domain. 

Using sparsity as a starting point, a large body of literature now exists presenting MRI reconstruction 
algorithms from significantly undersampled /c-space data. Existing improvements in CS-MRI mostly focus 
on (i) seeking sparse domains for the image, such as contourlets, etc. [5] [6]; (ii) using approximations of 
the £o norm for better reconstruction performance with fewer measurements, for example ^i, FOCUSS, 
ip quasi-norms with < p < 1, or using smooth functions to approximate the io norm [7]-[10]; and 
(in) accelerating image reconstruction through more efficient optimization techniques [35]. The modeling 
framework we present here is similarly motivated by these three objectives. 

CS-MRI reconstruction algorithms tend to fall into two categories: Those which enforce sparsity withing 
an image-level transform domain [3]-[16], and those which enforce sparsity on the patch-level (i.e., 
subregions of the image) [17]-[20]. Most CS-MRI reconstruction algorithms belong to the first category. 
For example Sparse MRI [3], the leading study in CS-MRI, performs MR image reconstruction by 
enforcing sparsity in both the wavelet domain and the total variation (TV) of the reconstructed image. 
Algorithms with image-level sparsity constraints such as Sparse MRI typically employ an "off-the-shelf" 
basis, which can usually capture only one feature of the image. For example, wavelets recover point-like 
features, while contourlets recover curve-like features. Since MR images contain a variety of underlying 
features, such as edges and textures, using a basis not adapted to the image can be considered a drawback 
of the algorithms in this group. 

Finding a basis that is suited to the image at hand while also preserving sparsity is key to successful 
MR image reconstruction. This is due to CS theory, in which it is shown that the required number of 
measurements is linked to the signal sparsity in the selected transform domain. A sparser representation 
requires fewer samples, thus allowing for faster MR imaging [1][2][3]. For reconstruction methods based 
on global image sparsity constraints, downsampling is limited to 2.5-3 fold [21]. Using a standard basis 
not adapted to the image under consideration will likely not provide a representation that can compete 
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in sparsity with an adapted basis. 

To this end, the second group of algorithms learn a sparse basis on image subregions called "patches" 
that is adapted to the image class of interest. One approach to adapted basis learning is dictionary learning. 
Recent studies in the image processing literature have shown that dictionary learning can find sparse 
representations of images on the patch-level [22]-[24], [33]. These algorithms learn a patch-level basis 
(i.e., dictionary) by exploiting structural similarities between patches extracted from images within a class 
of interest (for example BM3D [22], MOD [23] and K-SVD [24]). Among these approaches, adaptive 
dictionary learning based on patch-level sparsity constraints usually outperforms analytical dictionary 
approaches in denoising, super-resolution reconstruction, interpolation, inpainting, classification and other 
applications, since the adaptively learned dictionary suits the signals of interest [23]-[30]. 

Dictionary learning has been applied to CS-MRI as a sparse basis for reconstruction (e.g., LOST [18] 
and DLMRI [19]). Results using this approach demonstrate a significant improvement when compared 
with previous CS-MRI methods. However, these methods still have restrictions in that the dictionary size, 
patch sparsity and noise levels must be preset. In addition, algorithms such as dictionary learning that are 
based on only local image sparsity do not take into account additional image-level constraints, such as 
total variation, which can improve reconstruction. The work of Chen et al. [12] represents a step in this 
direction by using a comprehensive regularization framework that combines sparse dictionary learning 
with TV for CS-MRI. In [12], a dictionary is learned off-line on a separate set of images using K-SVD, 
and an i\ constraint on the patch reconstruction coefficients is employed for sparsity. A drawback of this 
approach is that off-line dictionary learning has been shown to have worse performance than learning 
dictionaries "on-line" directly on the image under consideration [24]. 

In this paper, we address the issues discussed above by proposing a new inversion framework for 
CS-MRI. Our work makes two contributions: 

1) We propose a combination of in situ dictionary learning and total variation as a sparsity constraint for 
the inverse CS-MRI problem. Dictionary learning is performed on-line within the MR image under 
consideration and finds a sparse image representation on the local level, while the total variation 
constraint controls image-level sparsity of finite differences. We use the alternating direction method 
of multipliers (ADMM) to derive an efficient optimization procedure [44]. 

2) We use a Bayesian approach to dictionary learning based on the beta-Bernoulli process [31]-[33]. 
This approach has three advantages: (i) it can learn the size of the dictionary from the data, (ii) it 
can leam the sparsity pattern on a patch-by-patch level, (Hi) it can adaptively leam regularization 
weights, which correspond to noise variance in the Bayesian framework. 



DRAFT 



4 



We organize the paper as follows. We review CS-MRI inversion methods and the beta process for 
dictionary learning in Section III In Section [Till we describe the proposed regularization framework 
and optimization algorithm. We then show the advantages of the proposed local/global regularization 
framework with nonparametric dictionary learning on several CS-MRI problems in Section [IVl 

II. Background and Related work 

In this section, we briefly review the problem of CS-MRI and the relevant approaches for MR image 
reconstruction. We then review a Bayesian method for dictionary learning called beta process factor 
analysis (BPFA), which we will employ in our inversion algorithm. 

We use the following notation: Let x G be a \fN x \fN image in vectorized form. Let J^u ^ C^^^, 
1^ < A^, be the undersampled Fourier encoding matrix and y = J^^x represents the sub-sampled set of 
fc-space measurements. The goal is to estimate x from the small fraction of fc-space measurements 3;. For 
dictionary learning, let Ri be the zth patch extraction operator. The operator Ri is a P x N matrix of all 
zeros except for a one in each row that extracts a vectorized \/P x \/P patch from the image, RiX G 
for i = 1, . . . , A^. We work with overlapping image patches with a shift of one pixel and allow a patch 
to wrap around the image at the boundaries for convenience [19] [28]. 

We focus on CS-MRI inversion via optimizing an unconstrained function of the form 

argmin h{x) + ^W^uX - y\\l, (1) 

X Z 

where — }^||2 is a data fidelity term, A > is a regularization parameter and h{x) is a regularization 
function that controls properties of the image we want to reconstruct. As discussed in the introduction, 
the function h can take several forms, but tends to fall into one of two categories according to whether 
image-level or patch-level information is considered. We next review these two approaches. 

A. MRI reconstruction using image-level sparse regularization 

MR image reconstruction from undersampled fc-space data with an image-level, or global regularization 
function hg(x) is one in which sparsity is enforced within a transform domain defined on the entire image. 
For example, in Sparse MRI [3] the regularization function is 

hg{x) = \\Wx\\i + fzTV{x), (2) 

where W is the wavelet basis and TV(x) is the total variation (spatial finite differences) of the image. 
Regularizing with this function requires that the image be sparse in the wavelet domain, as measured by 
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the £i norm of the wavelet coefficients which acts as a surrogate for £q [1][2]. The total variation 

term enforces homogeneity within the image by encouraging neighboring pixels to have similar values 
while allowing for sudden high frequency jumps at edges. The parameter fi > controls the trade-off 
between the two terms. 

Various other definitions of hg{x) have also been proposed for MRI reconstruction, for example 
over-complete contourlets [5], a combination of wavelets, contourlets and TV [6], and regularization 
of wavelet coefficient correlations based on Gaussian scale mixtures [4]. Other methods replace the ii 
norm with approximations of the Iq norm, for example FOCUSS [9] [10], ip norms [8], and homotopic 
io minimization [7]. 

Many numerical algorithms exist for optimizing ([TJ with an image-level hg{x), for example, nonlinear 
conjugate gradient descent with backtracking line search [3], an operator- splitting algorithm (TVCMRI) 
[11] and a variable splitting method (RecPF) [21]. Both TVCMRI and RecPF can replace iterative 
linear solvers with Fourier domain computations, with substantial time savings. Other methods in the 
literature include a combination of variable and operator splitting techniques [13], a fast composite 
splitting algorithm (FCSA) [35], a contourlet transform with iterative soft thresholding [5], a combination 
of Gaussian scale mixture model with iterative hard thresholding [4], a variation on Bregman operator 
splitting (BOS) [15] and alternating proximal minimization applied to the TV-based SENSE problem 
[16]. The above algorithms generally employ variable and operator splitting techniques with the EFT and 
alternating minimization to simplify the object function. In this work, we follow a similar approach. 

B. MRI reconstruction using patch-level sparse regularization 

An alternative to the image-level sparsity constraint hg{x) is a patch-level, or local regularization 
function /i/(x), which enforces sparsity in a transform domain defined on patches (square sub-regions of 
the image) extracted from the full image. An example of such a regularization function is. 



where the dictionary matrix is D ^ R^^^ and ai is a K-dimensional vector. An important difference 
between hi{x) and hg{x) is the additional function f{ai^D). While image-level sparsity constraints fall 
within a predefined transform domain, such as the wavelet basis, the sparse transform domain is typically 
unknown for patch-level regularization. Therefore, D and ai are unknown and must be learned, and the 
function / enforces sparsity by learning a D for which is sparse. Note that we suppress this dependence 
on a and D in hi{x). For example, [19] use K-SVD to learn D off-line, and then approximately optimize 




(3) 
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the objective function 

argmin \\RiX — Dai\\2 subject to ||Q;i||o < Vz, (4) 

i 

using orthogonal matching pursuits (OMP) [25]. (Note that this objective can be written using /(a^, D) = 
A^^||aJo for some /^^ > 0.) In this case, the extra parameters ai become part of the minimization in the 
objective function ([1]). From the squared error term in (@), we see that the io sparsity constraint gives 
an approximate sparse transform of x. Using this definition of hi{x) in Objective a local optimal 
solution can be found by an alternating minimization procedure: first solve the least squares solution for 
X using the current values of and D, and then update and D, or only if D is learned off-line. 

In addition to learning a sparse basis, the dictionary learning step can be thought of as a denoising 
procedure (we give more details in Section HFCl) . That is, the accumulation of each Dai in effect produces 
a denoised "proposal reconstruction" for x, after which the reconstruction takes into account the squared 
error from this smooth proposal and from the sub-sampled fc-space. 

Aside from sparse dictionary learning, other patch-level algorithms have been reported. For example, 
regularization of patches in a spatial region with a robust distance metric [17], patch clustering followed 
by de-aliasing and artifact removal for reconstruction using 3DFFT (LOST) [18] or directional wavelets 
[20]. These methods each take into account similarities between image patches. We next review our 
method for patch modeling through Bayesian nonparametric dictionary learning. 

C. Dictionary learning with beta process factor analysis 

Dictionary learning with the K-SVD requires a predefined dictionary size and the setting of either the 
sparsity level T, or an error threshold e to determine a patch- specific T^. In both cases, if the settings do 
not agree with ground truth, the performance will significantly degrade. To mitigate this problem, we use 
a Bayesian nonparametric method called beta process factor analysis (BPFA) [31], which has been shown 
to successfully infer both of these values, as well as have competitive performance with algorithms in 
several application areas [31]-[33], and see [45]-[48] for related algorithms. 

Being a Bayesian method, the prior definition gives a way (in principle) of generating images. Writing 
this generative method for BPFA gives an informative picture of what the algorithm is doing and what 
assumptions are being made. The result is a more transparent picture than could be given directly from 
the complicated functional form the prior takes in the objective function. To construct an image with 
BPFA, we use the hierarchical generative structure given in Algorithm [T] 
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Algorithm 1 Generating an image with BPFA 

1) Construct a dictionary D = [di^ . . . ,6?^]: 

dk^N{0,p-'lp), k = l,...,K. 

2) Draw a probability tt^ G [0, 1] for each dk'- 

TTk ~ Beta{ao^ 6o), = 1, . . . , i^. 

3) Draw precision values 7^ ~ Gamma{go^ ho) and 7^ ~ Gamma{eo, /o). 

4) For the ith patch in x: 

a) Draw the vector 5^ ~ A/" (0,77^ /i^). 

b) Draw the binary vector Zi with z^/^ ^ Bernoulli{Tik)' 

c) Define = 5^ o by an element-wise product. 

d) Construct the patch RiX = Dai + with noise ei ^ N{0^j~^lp). 

5) Construct the image x as the average of all RiX that overlap on a given pixel. 



With this approach, the model constructs a dictionary matrix D G of i.i.d. random variables, 

and assigns probability tt^ to vector dk- The parameters ao and &o for these probabilities are set such 
that most of the tt^ are expected to be small, with a few large; see [31] and the experiments section for 
more details. Each patch RiX extracted from the image x is modeled as a sparse weighted combination 
of the dictionary elements, as determined by the element- wise product of Zi G {0, 1}^ with the Gaussian 
vector Si. What makes the model nonparametric is that for many values of /c, there will be Zik = for 
all i; the model learns the number of these unused dictionary elements and their index values from the 
data. The independent Bernoulli random variables ensure values of zero for the kth element of each Zi 
when TTk is very small ("ensures" in the sense of having high probability), and thereby eliminates dk 
from the model. Therefore, the value of K should be set to a large number, more than the expected size 
of the dictionary. 

Though they are models for the same problem, BPFA and K-SVD have key differences. Aside from 
the nonparametric aspect of BPFA, another important difference is the ease with which the value of ||o 
can be inferred separately for each patch RiX, and the ability to perform inference on the noise variance 
parameter 7"^, which impacts the value of \\ai\\o in a similar way as the error threshold e impacts this 
cardinality for K-SVD. We use conjugate gamma prior distributions to infer the inverses of the variance 
parameters 7^ and 7^. We also note the absence of OMP from BPFA. 



DRAFT 



TABLE I 

Peak signal-to-noise ratio (PSNR) for image denoising with BFFA and K-SVD. BPFA and K-SVD have 

COMPARABLE PERFORMANCE WHEN THE NOISE PARAMETER OF K-SVD IS CORRECT (MATCH). BPFA OUTPERFORMS 
K-SVD WHEN THIS SETTING IS WRONG (MISMATCH). BPFA INFERS THE CORRECT NOISE LEVEL. 



Noise variance 


Noisy image (dB) 


K-SVD denoising 


BPFA denoising 


Match (dB) 


Mismatch (dB) 


Results (dB) 


Learned noise variance 


20 


22.13 


32.28 


28.94 


32.88 


20.43 


25 


20.16 


31.08 


28.60 


31.81 


25.46 


30 


18.58 


29.99 


28.35 


30.94 


30.47 



We briefly compare BPFA with K-SVD on a denoising problem. In these examples, we operate directly 
on the noisy image and not in fc-space. We consider noisy versions of an MR axial slice image of the 
brain (see Figure fT(a)] for an example). In Table U we show PSNR results for three noise variance levels. 



For K-SVD, we consider the case when the error parameter matches the ground truth, and when it 
mismatches it by a magnitude of five. As expected, when K-SVD does not have an appropriate setting 
of this value the performance suffers. BPFA on the other hand can adaptively infer the noise variance 
which leads to an improvement in denoising. 

In Figures [U and [2] we show example noisy images and reconstructions. In Figure [T(c)l we mark three 
locations where 6x6 patches are extracted to emphasize the ability of BPFA to adaptively learn more 
and less complicated regions. For example, in the simpler Region A, one dictionary element is necessary 
(a constant shift), since the patch is modeled well almost entirely by noise. In the more complicated 
Region C six dictionary elements are required for modeling. In Figure [2(a)] we show the denoised image. 



while in Figures |2(b)| and |2(c)[ we show reconstructions based on the number of dictionary elements 



required. The denoised image in Figure 2(a) is the sum of these two images. 



III. CS-MRI WITH BAYESIAN dictionary learning and TV PENALTY 

We next present our regularization scheme for reconstructing MR images from highly undersampled 
fc-space data. In reference to the discussion in Section |II1 we propose a combination of local and global 
sparsity constraints as follows: 

argmin Xghg{x) + hi{x) + ^\\J='uX - y\\l, (5) 
hg{x) - TV{x), hi{x) - 5] ^||i?ix - DaiWl + /(^). 
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(a) Noise variance equals 20. 



(b) Noise variance equals 30. 



(c) Three patch regions. 



Fig. 1. Example noisy images and three regions of varying patch sparsity. Patch A required 1 dictionary element (constant 
shift), patch B required 10 and patch C required 6 as inferred by the algorithm. We use a patch size of 6 x 6 for all experiments. 




(a) Denoising by BPFA 



(b) Reconstruction using ||ai||o > 4. (c) Reconstruction using \\ai\\o < 4. 



Fig. 2. (a) Image denoised with BPFA for noise variance equal to 20. (b) The reconstructed image using only patches requiring 
greater than four dictionary elements, (c) Reconstruction using patches requiring less than five dictionary elements. Higher 
frequency detail is captured with more dictionary elements. 



For the local regularization function hi{x) we use BPFA as given in Algorithm [T] in Section Hl-CI The 
parameters to be optimized for this penalty are contained in the set (fi = {D^Si^Zi^je^s^^}, and are 
defined in Algorithm [T] The regularization term 7^ is a model variable that corresponds to an inverse 
variance parameter of the multivariate Gaussian likelihood. This likelihood is equivalently viewed as the 
squared error penalty term in Objective (|5). We indicate how to construct the analytical form of / in 
the appendix. This term acts as the sparse basis for the image and also aids in producing a denoised 
reconstruction, as discussed in Section HFCl For the global regularization function hg{x) we use the total 
variation of the image. This term encourages homogeneity within contiguous regions of the image, while 
still allowing for sharp jumps in pixel value at edges due to the underlying ii penalty. The regularization 
parameters A^, 7^ and A control the trade-off between the terms in this optimization, which is adaptively 
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learned since 7^ changes with each iteration. 

For the total variation penalty TV{x) we use the isotropic TV model. Let be the 2 x difference 
operator for pixel i. Each row of contains a 1 centered on pixel i, and —1 on the pixel directly above 
pixel i (for the first row of tpi) or to the right (for the second row of tpi), and zeros elsewhere. Let 
"if = [^f , . . . , ^^]^ be the resulting 2N x N difference matrix for the entire image. The TV coefficients 
are /3 = ^^x E R^^, and the isotropic TV penalty is TV(x) ^ J2i\\i^i^h = J2i \j ^li-i + l^lv where i 
ranges over the pixels in the MR image. Several algorithms have been proposed for TV minimization, 
for example using Newton's method [49] or graph cuts [50]. Recently, a simple and efficient method 
based on the alternating direction method of multipliers, also called the split Bregman method, has been 
proposed for TV denoising models [42]. We adopt this approach in our optimization algorithm. 

A. Alternating Direction Method of Multipliers 

The alternating direction method of multipliers (ADMM) is a general algorithmic approach to convex 
optimization [43]. For our model, ADMM works by performing dual ascent on the augmented Lagrangian 
objective function introduced for the total variation coefficients. Though our overall objective is not convex 
due to the dictionary learning terms, we note that when holding the dictionary learning parameters fixed, 
the resulting TV denoising problem for which we use ADMM is convex. 

To review the general form of ADMM we are interested in, we start with the convex optimization 
problem 

min ||A^ — 6II2 + (6) 

X 

where /i is a non-smooth convex function, such as an ^1 penalty. ADMM decouples the smooth squared 
error term from this penalty by introducing a second vector v such that 

min ll^x — 6II2 + /i(v) subject to v = x. (7) 

X 

This is followed by a relaxation of the equality v = x via an augmented Lagrangian term 

L(x, V, J]) = \\Ax -h\\l + h{v) + rf{x - v) + ^ ||x - v\\l (8) 

A minimax saddle point is found with the minimization taking place over both x and v and dual ascent 
for 77. The solution for x of the relaxed problem in dSj is the same as that of ^ [43]. 

Another way to write the objective in dS) is to define u = {l/p)r] and combine the last two terms. The 
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result is an objective that can be optimized by cycling through the following updates for x, v and w, 

X = argmin ||Ax-6||2 + V + WII2, (9) 

v' — argmin /i(v) + -llx' — v + w||2, (10) 
V 2 



- w+y-v'. (11) 



u 



This algorithm simplifies the optimization since the objective for x is quadratic and thus has a simple 
analytic solution, while the update for v is a proximity operator of h with penalty p, the difference being 
that V is not pre-multiplied by a matrix as x is in (|6]). Such optimization problems tend to be much easier 
to solve; for example when h is the TV penalty the solution for v is analytical. See Boyd, et al. [43] for 
a detailed review of the ADMM algorithm, where (|6]) is one specific application they consider. 

B. Algorithm 

We present an algorithm for finding a local optimal solution to the non-convex objective function given 
in ([5]). We re-write this objective as 

L(x, ^) = Il^.-^lb + ^ - DoL,\l + E. /(^z) + \ W^ux - y\l (12) 

We seek to minimize this objective with respect to x and the set ^pi = {i?, Si^Zi^^e^ Ts, 

We begin by defining the TV coefficients for the ith pixel as := [/32i-i /52i]^ = i^iX. We introduce 
the vector of Lagrange multipliers 77^, and then split f3^ from by relaxing the equality via an augmented 
Lagrangian. This results in the objective function 

L(x,/3,7/,(^) = A,E^ll^zll2 + r/f(^,x-^J + f||^,x-/3j2 

+ J:^^\\R^^-Da,g + f{^,) 

+ ^ll^u^-3^ll2- (13) 

From the ADMM theory, this objective wi 

tisfie(J^ 



so the equality constraints will be satisfiec 



1 have (local) optimal values /3* and x* with /3* = ^^x*, and 
As written in ([T3l) , optimizing this function can be split into 
three separate sub-problems: one for /3^, one for (pi = {D^Si^Zi^je^js^^} and one for x. Following the 
discussion of ADMM in Section UlI-AL we define Ui = {l/p)r]i and complete the square in the first line 
of ([T3]) . We then cycle through the following three sub-problems, 

^We note that for a fixed D and ai-.N, the solution is also globally optimal. 
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Algorithm 2 Outline of algorithm 

Input: y. undersampled fc-space data 
Output: x\ reconstructed MR image 

Step 1. Initialize x = J^^y (zero filling), and w = 0. Initialize BPFA variables using x. 

Step 2. Solve PI sub-problem by optimizing /3 via shrinkage. 

Step 3. Update P2 sub-problem by Gibbs sampling BPFA variables. 

Step 4. Solve P3 sub-problem in Fourier domain, followed by inverse transform. 

Step 5. Update Lagrange multiplier vector u. 

if not converged then return to Step 2. 



(PI) f3[ = argmin^ A^||^||2 + f||^,x-^ + w,||i, z = l,...,7V, 
(P2) ip' = argmin^ Y.i^\\Ri^ - Dai\\l^ f{ipi), 

(P3) x' = argmin^ Y.^^U^^ - ^^\\l^Y.^i\\R^^ - D' a[^^^^ 

u'- Ui-\- ipix' - /3i, z = 1, . . . , TV. 

For each sub-problem, we use the most recent values of all other parameters. Solutions for PI and 
P3 are globally optimal and in closed form, while the update for Ui follows from the review of ADMM 
in Section IIII-A[ Since P2 is non-convex, we cannot perform the desired minimization, and so an 
approximation is required. Furthermore, this problem requires iterating through several parameters, and 
so a local optimal solution cannot be given in closed form either. Our approach is to use stochastic 
optimization for problem P2 by Gibbs sampling each variable in BPFA conditioned on current values 
for all other variables. We next present the updates for each sub-problem. 

1) Algorithm for PI sub-problem: We can solve for exactly for each pixel z = 1, . . . , by using 
a generalized shrinkage operation [43], 

f3[ = max(||^,x + ^.Ib - ^,o) • (14) 

[ p J \\iJiX + Ui\\2 

We recall that corresponds to the 2-dimensional TV coefficients for pixel z, with differences in one 
direction vertically and horizontally. These coefficients have been been split from using ADMM, but 
gradually converge to one another and become equal in the limit. We recall that after updating x, we 
update the Lagrange multiplier = + ipix' — j3'-. 

2) Algorithm for P2 sub-problem: We update the parameters of BPFA using Gibbs sampling. We are 
therefore stochastically optimizing ([T3l) , but only for this sub-problem. With reference to Algorithm [T] 
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for BPFA, the P2 sub-problem entails sampling new values for the dictionary D, the binary vectors zi 
and weights si, with which we construct ai ^ siozi through the element- wise product, the precisions 7^ 
and 75, and the beta probabilities tii:k^ which give the prior probability that zik = 1. In principle, there 
is no limit to the number of samples that can be made, with the final sample giving the updates used 
in the other sub-problems. We found that a single sample is sufficient in practice and leads to a faster 
algorithm. The samples we make are given below. 

a) Sample D: We define the P x matrix X = . . . , i?Arx], which is the matrix of all 
vectorized patches extracted from the image x. We also define the K x matrix ol = [ai^ . . . ^a^] 
containing the dictionary weight coefficients for the corresponding columns in X such that Dol is an 
approximation of the mean of X prior to additive Gaussian noise. The update for the dictionary D is 

D = Xa^(aa^ + (P/7,)/p)-i + £;, (15) 

Ep, 7V(0,(7.aa^ + P/p)-'), p=l,...,P. 

We note that the first term in Equation (ITSl) is the ^2 -regularized least squares solution for D. To this is 
added Gaussian noise that is correlated across dictionary elements. Since both the number of pixels and 
7^ will tend to be very large, the variance of the noise is small and the mean term dominates the update 
for D. 

b) Sample sik and Zi^: We sample the values for weights Sik and indicators Zik as a block. We recall 
that to block sample two variables from their joint distribution, (5, z) ^ p{s, z), one can first sample z 
from the marginal distribution, z ^ p{z), and then sample s\z p{s\z) from the conditional distribution. 
The other sampling direction is possible as well, but for our problem sampling z ^ s\z is more efficient 
in finding a mode of the objective function. 

We define _/e to be the residual error in approximating the ith patch with the current values from 
BPFA minus the kth dictionary element, _/e = RiX — J2j^k(^ij^ij)^j- We then sample Zik from a 
Bernoulli distribution Zik ^ Pik^i + (1 — Pik)^o where, following a simplification of terms, 

Pik 7Tk(^^ + ^d^dk^ exp |y -/c)^/(7s/7£ + , (16) 

1-pik OC l-7Tk. (17) 

We observe that the probability that Zik = 1 takes into account how well dictionary element dk correlates 
with the residual Vi^^k- After sampling Zik we sample the corresponding weight Sik from its conditional 
posterior Gaussian distribution, 

Sik ^ N {zikdlri_k/{ls/le + dldk), (7^ + ^eZikdkdk)~^) . (18) 
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When = I, the mean of Sik is the regularized least squares solution and the variance will be small 
if 7^ is large. When Zik = 0, Sik is sampled from the prior, but we note that the value of Sik does not 
factor into the model in this case, since SikZik = and Sik is integrated out when Zik is next sampled. 

c) Sample 7^ and 75.* We next sample from the conditional gamma posterior distribution of the 
noise precision and weight precision, 

7^ - Gamma{go + lPN,ho + lJ2i\\Ri^-Daig), (19) 

7^ - Gamma{eQ + ^ J2ik ^iki /o + ^ J2ik ^ik^ik)- (20) 

The expected value of each variable is the first term of the distribution divided by the second. We note 
that this is close to the inverse of the average empirical error for 7^. 

d) Sample Hk' The conditional posterior of is a beta distribution, and we sample it as follows, 

TT/e Beta (ao + Y.i ^ik, bo + ^^(1 - Zik)) . (21) 

The parameters to the beta distribution are essentially counts; the first parameter includes the number 
of times dictionary element d^ was used by the patches, and the second term includes a count of the 
number of times a patch did not use d^. 

3) Algorithm for P3 sub-problem: The final sub-problem is to reconstruct the image x. The corre- 
sponding objective function is 

X = arg mjn ^ - f^i + ^iWl + W^i^ - Dai\\l + ^W^u^ - yWl- 

i i 

Since this is a least squares problem, x has a closed form solution that satisfies 

(p^T^ + 7. RIR^ + A^^^n) X = p^^{/3 -u) + 7.i^^BPFA + AJ-f 3^. (22) 

We recall that is the matrix of stacked The vector /3 is also obtained by stacking each /3^, and 
similarly u is the vector formed by stacking w^. The vector Xbpfa is the proposed reconstructed image 
from BPFA using the current D and ai:N, which results from the equality Pxbpfa = J2i RfDc^i. 

We observe that inverting the left N x N matrix is computationally prohibitive, since N is the number 
of pixels in the image; for example we consider images that have N = 256^ pixels. Fortunately, given the 
form of the matrix in Equation (122)) we can simplify the problem by working in the Fourier domain, which 
allows for element- wise updates in fc-space, followed by an inverse Fourier transform. We represent x as 
X = T^O, where 9 is the Fourier transform of x and the superscript H denotes the conjugate transpose. 
We also take the Fourier transform of each side of Equation (l22l) to give 

jr (^^T^ ^ ^T^^ ^ XT^Tu) T^^e = pT^^{^ -u)^ 7.^^^BPFA + XTT^y. (23) 
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The left-hand matrix simplifies to a diagonal matrix, 

T [p^^^ + 7. + y^^u^u) = M + lePlN + A/]^. (24) 

Term-by-term this results as follows: The product of the finite difference operator matrix ^ with itself 

yields a circulant matrix, which has the rows of the Fourier matrix T as its eigenvectors. It follows that 

T^^^T^ diagonalizes and produces a matrix of eigenvalues A. The matrix R^Ri is a matrix of 

all zeros, except for ones on the diagonal entries that correspond to the indices of x associated with the ith 

patch. Since each pixel appears in P patches, the sum over i gives P/at, and the Fourier product cancels. 

The final diagonal matrix /]^ also contains all zeros, except for ones along the diagonal corresponding 

to the indices in fc-space that are measured, which results from TT^ Tu^^ - 

Since the left matrix is diagonal following the left and right Fourier matrix multiplications, updating 

the Fourier coefficients d oix becomes a set of one-dimensional problems. It follows that the Fourier 

transform of the optimal x for sub-problem P3 is 

. ^ - ^) + -iePTjX^^^^ + XTjT^y 

pk,,^^,P^\T,T^\ ' ^^^^ 

In the numerator, we observe that if i is not a measured fc-space location, the right-most term equals 

zero; otherwise it equals the measurement in fc-space. In the denominator, using a vector of ones 1, the 

right-most term will equal A if z is a measured fc-space location, and zero otherwise. We invert Q via the 

inverse Fourier transform to obtain the reconstructed MR image x^ 

IV. Experiments and Discussion 

We present experimental results on synthetic and MRI data using the proposed CS inversion algorithm 
given in Section IIII-B[ We consider a variety of sampling rates and masks, and compare with two other 
algorithms: SparseMRI [3] and FCSA [35]. For these algorithms, we use publicly available code with 
their built in parameter settings from the authors' websites. We also compare with BPFA without the 
total variation regularization, which is a special case of our algorithm with A^ = 0. All the experiments 
are implemented with Matlab 2010 and run on an Intel Core CPU at 2.8G and 4G memory. 

A. Set-up 

We use three sampling trajectories in fc-space: Cartesian sampling with random phase encodes, random 
sampling and radial sampling. In the latter scheme, samples are taken on a Cartesian grid at the points 
nearest to radial lines uniformly spaced in angle. We show examples of these trajectories in Figure |3] 
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Fig. 3. Example masks used to sample /c-space with an acceleration speed of 4. (left) Cartesian mask, (middle) random mask, 
(right) radial mask. 

We considered several subsampling rates for each trajectory, measuring 10%, 20%, 25%, 30%, and 35% 
of fc-space. As a performance measure we use the peak signal-to-noise ratio (PSNR) to the ground truth 
image, in addition to qualitative performance comparisons. 

For all images considered, we extract 6x6 patches where each pixel defines the upper left comer of 
a patch; we wrap around the image at the boundaries. For the synthetic data we learn complex-valued 
dictionaries, while for the MR data we restrict the model to real-valued dictionaries. We initialize the 
reconstruction x using zero-filling in fc-space. We use a dictionary with K = 128 initial dictionary 
elements, recalling that the final number of dictionary elements will be smaller due to the sparse BPFA 
prior. We randomly initialized the dictionary elements by sampling from the prior given in Algorithm [1] 
We ran 200 iterations of the algorithm, which we observed was sufficient for convergence to within a 
reasonable threshold according to the PSNR. 

For regularization parameters, we set the data fidelity regularization A = 10^, the total variation 
regularization = 100 and the ADMM parameter p = 1000 for the BPFA and BPFA-kTV algorithms. 
The results were not very sensitive around these values, but did begin to degrade when they deviated 
far from this setting. We set ao = 1/A^ and ^ 1 — 1/K io encourage sparsity and approach the beta 
process in the limit. We set the remaining parameters cq = do = = /o = 1. For image reconstructions 
using BPFA and BPFA-i-TV, we use the denoised reconstruction from the dictionary, Xbpfa, rather than x. 
For small-noise images, we did not observe much difference between the two, but for noisy images we 
found that Xbpfa produced a better denoised image, as can be expected given the discussion in Section HFCl 
This is an advantage of dictionary learning based algorithms that is not available with an off-the-shelf 
basis as use by most inversion algorithms. 
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(a) sampling mask (b) original image 



(c) zero-filling 



O BPFA 
— * — BPFA+TV 
SparseMRI 











(d) SparseMRI 



(e) BPFA+TV 





(f) Cartesian sampling 



(g) random sampling 



(h) radial sampling 



Fig. 4. (top) Example noiseless phantom image reconstruction with 15% sampling rate and Cartesian sampling, (bottom) 
Reconstruction PSNR as a function of percentage sampled in /c-space for noiseless phantom and different sampling trajectories. 




(a) sampling mask (b) noisy original 



ft) 



(c) zero-filling 




(d) SparseMRI 



(e) BPFA-hTV 






(f) Cartesian sampling 



(g) random sampling 



(h) radial sampling 



Fig. 5. (top) Example noisy phantom image reconstruction with 35% sampling rate and Cartesian sampling, (bottom) 
Reconstruction PSNR as a function of percentage sampled in /c- space for phantom image with additive noise having standard 
deviation a = 0.03. 
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(b) zero-filling 
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(e) learned dictionary 



(f) dictionary histogram 



(g) patch histogram 



Fig. 6. (a)-(d) Reconstruction of GE data with additive white noise (a = 0.01) and 30% /c-space Cartesian sampling, (e) 
learned dictionary, (f) a histogram of the number of times each dictionary element was used, (g) a histogram of how many 
dictionary elements each patch used. 



B. Experiments on simulated data 

We first consider noiseless and noisy simulated images. We show results for the 256 x 256 noiseless and 
noisy phantom image in Figures |4] and |5] respectively. The top row of Figure |4] shows the original image 
and reconstructions for 15% Cartesian sampling. The bottom row shows PSNR for the three sampling 
trajectories over several sampling rates. Figure shows similar results on the phantom image with 
additive Gaussian noise having standard deviation a = 0.03. In these examples, we see that BPFA+TV 
significantly improves on zero-filling (taking the missing fc-space values to equal zero), and improves 
upon SparseMRI as well. Since the available code for FCSA is not written for complex data, comparison 
with SparseMRI is provided only. In addition to removing noise, the BPFA-based models were able to 
significantly reduce the ringing that results from the missing Cartesian trajectories. Noisy results with 
various other settings of a were consistent with the results shown. 

We also use a simulated GE phantom to evaluate the dictionary learning done by BPFA. In Figure [6] 
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we show the noisy GE image, which we sample at 30% in fc-space with Cartesian sampHng. The initial 
number of dictionary elements is 128, but we observe that the model learns an overcomplete bases of 
105 dictionary elements, with the remaining elements removed from the model. The dictionary elements 
are given in Figure |6^. We also show patch statistics, where we see that no patch used more than 1 1 
dictionary elements. 

C. Experiments on MRI 

We next present our experiments on five MR images: the circle of Willis (Figure |7]), lumber (Figure 
[8]), shoulder (Figure |9]), brain (Figure [TOj and coronal (Figure [n). In each pair of figures, we show an 
example reconstruction for qualitative evaluation as well as PSNR results for each sampling trajectory 
and rate. In the qualitative results, we zoom in on a region of the image for better comparison of the 
reconstructions with the original. 

In general, we observe that BPFA+TV learned smoother reconstructions than FCSA. We particularly 
note the ringing effect in the shoulder example (Figure |9j for FCSA and SparseMRI, which is absent from 
the BPFA+TV result. This is similar to the observed results for the synthetic phantom image shown in 
Figure m Reconstruction of the circle of Willis, brain and coronal images also has significantly more noise 
for FCSA than BPFA+TV. The PSNR results consistently show that the BPFA-based models outperform 
the comparison models. We also note that, in general, adding the total variation penalty slightly improves 
the reconstruction when compared with the base BPFA algorithm. Since the fraction of the per-iteration 
running time used for TV minimization is very small, there is no trade-off in adding this penalty. 

Consistent with other MRI research results, the best performance was obtained with pseudo-random 
sampling, while the worst was obtained with Cartesian sampling. Since Cartesian is the most practical 
approach, however, results such as that given in Figure |9] are particularly encouraging and highlight 
the usefulness of the BPFA approach. Roughly speaking, the per-iteration running time for all images 
decreased from 15 to 7 seconds as the sparsity of BPFA became more dominant with each iteration. 

We also evaluate in more detail the performance of BPFA for MRI inversion in Figure [12] using the 
brain image with 35% Cartesian sampling in fc-space. First, we consider the value of sampling the BPFA 
regularization term 7^ during each iteration, which we refer to as adaptive learning. We compare sampling 
this value with fixing it in advance to several different values. In Figure [T2^ . we show PSNR results 
for these cases, where we see that adaptively learning 7^ improves the reconstruction results. As seen in 
Figure [T2h . we consider fixing 7^ to the range of values covered during the 200 iterations of adaptive 
learning. Image residuals are shown in Figures [T2k-e. including the residual from zero filling. 
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(a) original 




(b) BPFA+TV 



(c) FCSA 



(d) SparseMRI 




(e) Cartesian sampling 



(f) random sampling 



(g) radial sampling 



Fig. 7. (top) Reconstruction of Circle of Willis image with 20% random sampling in /c-space. The inset shows a region in 
more detail, (e)-(g) Reconstruction PSNR as a function of percentage sampled in /c-space for various trajectories. 




(e) Cartesian sampling 



(f) random sampling 









T 
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(d) SparseMRI 



(g) radial sampling 



Fig. 8. (top) Reconstruction of lumbar image with 30% Cartesian sampling in /c-space. The inset shows a region in more 
detail, (e)-(g) Reconstruction PSNR as a function of percentage sampled in A;-space for various trajectories. 
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(a) original 



(b) BPFA+TV 



(c) FCSA 



(d) SparseMRI 



BPFA 

— * — BPFA+TV 
— \ — FCSA 
SparseMRI 












(e) Cartesian sampling 



(f) random sampling 



(g) radial sampling 



Fig. 9. (top) Reconstruction of shoulder image with 35% Cartesian sampling in /c-space. The inset shows a region in more 
detail, (e)-(g) Reconstruction PSNR as a function of percentage sampled in /c-space for various trajectories. 
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(c) FCSA 
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(e) Cartesian sampling 



(f) random sampling 



(g) radial sampling 



Fig. 10. (top) Reconstruction of brain image with 20% radial sampling in /c-space. The inset shows a region in more detail. 
(e)-(g) Reconstruction PSNR as a function of percentage sampled in /c-space for various trajectories. 
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(a) original 



(b) BPFA+TV 



(c) FCSA 
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-{ — FCSA 
SparseMRI 



(d) SparseMRI 




(e) Cartesian sampling 



(f) random sampling 



(g) radial sampling 



Fig. 11. (top) Reconstruction of coronal image with 20% random sampling in /c-space. The inset shows a region in more 
detail, (e)-(g) Reconstruction PSNR as a function of percentage sampled in /c-space for various trajectories. 



As with the total variation penalty, sampling 7^ requires virtually no running time. Therefore, there is 
no trade off in this regard. However, we note that there was a significant difference in running time, with 
adaptive learning require roughly half the time required when setting 7^ = 5 x 10^. This is because in the 
early iterations, the dictionary is being learned on a more complicated, less smooth x since we initialize 
with zero filling. A large value of 7^ will require much more of the dictionary since BPFA will try to fit 
more complicated patches to much higher precision. By adaptively learning 7^ and allowing this value to 
grow as in Figure [T2h . the model is starting from a less confident approximation and gradually becoming 
more confident as the reconstructed x stabilizes. Since 7^ is also the regularization parameter for the 
dictionary learning contribution to the reconstruction of x, we see that BPFA will have more influence in 
the reconstructed x as the number of iterations increases. We also note that the values of 7^ also should 
be used to motivate the setting of A and A^. The update for 7^ is intuitive, being roughly the inverse of 
the squared error, and could also be employed by other algorithms such as K-SVD, but we note that the 
Bayesian approach gives a principled framework for making such updates to the regularization parameter. 

We also show the dictionary learned in Figure [T2f and the dictionary and patch histograms in Figures 
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(a) zero filling (b) 7^ = 10^ (c) 7e = 3 x 10^ (d) 7e = 5 x 10^ (e) adaptive 7^ 




(f) learned dictionary (g) PSNR v.s. iteration 




(h) 7e v.s. iteration (i) dictionary histogram (j) patch histogram 



Fig. 12. Example of BPFA+TV model results for the brain image with 35% Cartesian sampling in /c-space. We focus on 
comparing the learning of 7e with the pre-setting of this parameter, (a)-(e) show the residuals for different settings, (g) Shows 
the PSNR as a function of iteration for the three settings and adaptive 7e. (h) Shows the value of 7e for the adaptive case. In 
(f), (i) and (j) we show statistics from the BPFA model similar to Figure |6](e) -(g). 

[T2l-j. We have sorted the dictionary elements by the number of times they were used by the patches. We 
observed that at the first iteration, the 128 dictionary elements were used equally by the model. As the 
number of iterations increased, the sparsity encouraging mechanism of BPFA shrunk this usage to about 
108 elements, which were not equally used among the patches. The patch histogram shows that the each 
patch used on average 15 elements for reconstruction, with a second mode around one or two elements 
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for the smooth parts of the image. 



V. Conclusion 



We have presented an algorithm for CS-MRI reconstruction that incorporates total variation mini- 
mization with Bayesian dictionary learning. Our Bayesian approach uses a nonparametric model called 
beta process factor analysis (BPFA) for in situ dictionary learning. Through this hierarchical generative 
structure, we can learn the dictionary size, sparsity pattern and additional regularization parameters. 
We presented an efficient optimization algorithm using the alternating direction method of multipliers 
(ADMM) and MCMC Gibbs sampling for all BPFA variables. Experimental results on several MR images 
showed that our proposed regularization framework compares favorably with other algorithms for various 
sampling trajectories and rates. 

VI. Appendix 

We give some additional details of the Bayesian structure of our dictionary learning approach. The 
unknown variables of the model are D = {di, . . . , dx}, tv = {tti, . . . , ttk}, {si}i=i:N, {zi}i=i:N, Te, Is- 
The "data" from the perspective of BPFA is the set of patches extracted from the current reconstruction, 
{Rix}i=i:N. The joint likelihood of these variables and data is 

p{{RiX}, D, TT, {Si}, {Zi}, le^ls) = 



N 



.1=1 



K 



k=l 



P{le)p{ls)- (26) 



The first group constitutes the patch- specific part of the likelihood. The second group contains the 
dictionary elements and their probabilities and the remaining distributions are for inverse variances. The 
specific distributions are given in Algorithm [1] from which the functional form of the joint likelihood can 
be obtained. The dictionary learning part of the objective, also referred to as subproblem P2, is 

^ y \\RiX - DaiWl + /(^i) = - lnp({i?ix}, D, tt, {^J, {zJ, 76,75). 

i 

Optimizing this non-convex function is equivalent to finding a mode of the joint likelihood. Rather than 
use a deterministic gradient-based method, we use the MCMC Gibbs sampling method from statistical 
inference to stochastically find a mode. The functional form of the log joint likelihood is not very insightful 
and is unnecessary for deriving the Gibbs sampling algorithm. We note that many of the updates are 
essentially solutions to least squares problems, and so are intuitive from an optimization perspective. 
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